You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Kelvin Tan (JIRA)" <ji...@apache.org> on 2005/08/25 01:06:09 UTC

[jira] Updated: (NUTCH-84) Fetcher for constrained crawls

     [ http://issues.apache.org/jira/browse/NUTCH-84?page=all ]

Kelvin Tan updated NUTCH-84:
----------------------------

    Attachment: oc-0.3.zip

Javadocs included in the zip and also available online at http://www.supermind.org/code/oc/api/index.html.

Code is released under APL, but I've also included the Spring jars you'll need to run it.

> Fetcher for constrained crawls
> ------------------------------
>
>          Key: NUTCH-84
>          URL: http://issues.apache.org/jira/browse/NUTCH-84
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Kelvin Tan
>     Priority: Minor
>  Attachments: oc-0.3.zip
>
> As posted http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: (NUTCH-84) Fetcher for constrained crawls

Posted by Kelvin Tan <ke...@relevanz.com>.

Instructions for running:

1. Change build.properties to your location of nutch

2. ant nutch-deploy
Ant copies relevant jars to nutch_home/lib, and beans.xml to nutch_home/conf

3. Edit nutch_home/conf/beans.xml (the Spring framework conf file)
Important values to change are obviously the ones involving file paths. In particular, change the location of the file for seeding the crawl. 
Nutch-style one URL per line please.

Look also at SizeConstrainedFLFilter. This limits the size of the crawl to the number you put there (great for test runs, but not so hot for whole-web crawls).

4. Fire up cygwin or bash.
Go to nutch home, and run
./nutch org.supermind.crawl.CrawlTool

This should start the crawler (and hopefully it'll run till completion!)

For a space of a _week_ or so, its ok to mail me privately if you need help getting things up and running: kelvin at supermind dot org. 

Javadocs included in the zip and also available online at http://www.supermind.org/code/oc/api/index.html.

Again, I'd like to emphasize the beta nature of the code, so please be forgiving.

Cheers,
k

On Thu, 25 Aug 2005 01:06:09 +0200 (CEST), Kelvin Tan (JIRA) wrote:
>�[ http://issues.apache.org/jira/browse/NUTCH-84?page=all ]
>
>�Kelvin Tan updated NUTCH-84:
>�----------------------------
>
>�Attachment: oc-0.3.zip
>
>�Javadocs included in the zip and also available online at
>�http://www.supermind.org/code/oc/api/index.html.
>
>�Code is released under APL, but I've also included the Spring jars
>�you'll need to run it.
>
>>�Fetcher for constrained crawls
>>�------------------------------
>>
>>�Key: NUTCH-84
>>�URL: http://issues.apache.org/jira/browse/NUTCH-84 Project: Nutch
>>�Type: Improvement Components: fetcher Versions: 0.7 Reporter:
>>�Kelvin Tan Priority: Minor Attachments: oc-0.3.zip
>>
>>�As posted http://marc.theaimsgroup.com/?l=nutch-
>>�developers&m=112476980602585&w=2