You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by hzhong <he...@gmail.com> on 2007/05/09 19:52:52 UTC

Nutch Crawl

Hello,

I currently have nutch running on hadoop.  However, for one specific crawl,
I would like to store the data on a local machine instead of putting it on
hadoop.

I basically modified the crawl.java to change the filesystem to local.  
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
FileSystem localFs = FileSystem.getNamed("local", conf);	                
JobConf job = new NutchJob(localFs.getConf());

Path dir = new Path(some_local_path_on_the_machine);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path rootURL = new Path(local_path_on_the_machine);

Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
LinkDb linkDbTool = new LinkDb(conf);
Indexer indexer = new Indexer(conf);
DeleteDuplicates dedup = new DeleteDuplicates(conf);
IndexMerger merger = new IndexMerger(conf);
                	                
// initialize crawlDb
injector.inject(crawlDb, rootURL);
and so on... 

I keep getting 
Injector: starting
Injector: crawlDb: crawl_db path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Connection refused

or 

Injector: starting
Injector: crawlDb: crawldb path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Input path doesnt exist : url path

However, the url path does exist.  

Can someone give me pointers as to what's going on?  Or perhaps give me
pointers on how to store data on a local machine?  I am not sure if this is
the correct way of putting the data on the local machine.  

Thank you very much.

Hanna
-- 
View this message in context: http://www.nabble.com/Nutch-Crawl-tf3717311.html#a10399449
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch Crawl

Posted by Espen Amble Kolstad <es...@trank.no>.

Just do normal crawl in hadoop and use:
bin/hadoop dfs -get crawldir local_path
to store it on the local filesystem after the crawl is done.

- Espen

hzhong wrote:
> Hello,
> 
> I currently have nutch running on hadoop.  However, for one specific crawl,
> I would like to store the data on a local machine instead of putting it on
> hadoop.
> 
> I basically modified the crawl.java to change the filesystem to local.  
> Configuration conf = NutchConfiguration.create();
> conf.addDefaultResource("crawl-tool.xml");
> FileSystem localFs = FileSystem.getNamed("local", conf);	                
> JobConf job = new NutchJob(localFs.getConf());
> 
> Path dir = new Path(some_local_path_on_the_machine);
> Path crawlDb = new Path(dir + "/crawldb");
> Path linkDb = new Path(dir + "/linkdb");
> Path segments = new Path(dir + "/segments");
> Path indexes = new Path(dir + "/indexes");
> Path index = new Path(dir + "/index");
> Path rootURL = new Path(local_path_on_the_machine);
> 
> Injector injector = new Injector(conf);
> Generator generator = new Generator(conf);
> Fetcher fetcher = new Fetcher(conf);
> ParseSegment parseSegment = new ParseSegment(conf);
> CrawlDb crawlDbTool = new CrawlDb(conf);
> LinkDb linkDbTool = new LinkDb(conf);
> Indexer indexer = new Indexer(conf);
> DeleteDuplicates dedup = new DeleteDuplicates(conf);
> IndexMerger merger = new IndexMerger(conf);
>                 	                
> // initialize crawlDb
> injector.inject(crawlDb, rootURL);
> and so on... 
> 
> I keep getting 
> Injector: starting
> Injector: crawlDb: crawl_db path
> Injector: urlDir: url path
> Injector: Converting injected urls to crawl db entries.
> Connection refused
> 
> or 
> 
> Injector: starting
> Injector: crawlDb: crawldb path
> Injector: urlDir: url path
> Injector: Converting injected urls to crawl db entries.
> Input path doesnt exist : url path
> 
> However, the url path does exist.  
> 
> Can someone give me pointers as to what's going on?  Or perhaps give me
> pointers on how to store data on a local machine?  I am not sure if this is
> the correct way of putting the data on the local machine.  
> 
> Thank you very much.
> 
> Hanna