You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by hzhong <he...@gmail.com> on 2007/05/09 19:52:52 UTC
Nutch Crawl
Hello,
I currently have nutch running on hadoop. However, for one specific crawl,
I would like to store the data on a local machine instead of putting it on
hadoop.
I basically modified the crawl.java to change the filesystem to local.
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
FileSystem localFs = FileSystem.getNamed("local", conf);
JobConf job = new NutchJob(localFs.getConf());
Path dir = new Path(some_local_path_on_the_machine);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path rootURL = new Path(local_path_on_the_machine);
Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
LinkDb linkDbTool = new LinkDb(conf);
Indexer indexer = new Indexer(conf);
DeleteDuplicates dedup = new DeleteDuplicates(conf);
IndexMerger merger = new IndexMerger(conf);
// initialize crawlDb
injector.inject(crawlDb, rootURL);
and so on...
I keep getting
Injector: starting
Injector: crawlDb: crawl_db path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Connection refused
or
Injector: starting
Injector: crawlDb: crawldb path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Input path doesnt exist : url path
However, the url path does exist.
Can someone give me pointers as to what's going on? Or perhaps give me
pointers on how to store data on a local machine? I am not sure if this is
the correct way of putting the data on the local machine.
Thank you very much.
Hanna
--
View this message in context: http://www.nabble.com/Nutch-Crawl-tf3717311.html#a10399449
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch Crawl
Posted by Espen Amble Kolstad <es...@trank.no>.
Just do normal crawl in hadoop and use:
bin/hadoop dfs -get crawldir local_path
to store it on the local filesystem after the crawl is done.
- Espen
hzhong wrote:
> Hello,
>
> I currently have nutch running on hadoop. However, for one specific crawl,
> I would like to store the data on a local machine instead of putting it on
> hadoop.
>
> I basically modified the crawl.java to change the filesystem to local.
> Configuration conf = NutchConfiguration.create();
> conf.addDefaultResource("crawl-tool.xml");
> FileSystem localFs = FileSystem.getNamed("local", conf);
> JobConf job = new NutchJob(localFs.getConf());
>
> Path dir = new Path(some_local_path_on_the_machine);
> Path crawlDb = new Path(dir + "/crawldb");
> Path linkDb = new Path(dir + "/linkdb");
> Path segments = new Path(dir + "/segments");
> Path indexes = new Path(dir + "/indexes");
> Path index = new Path(dir + "/index");
> Path rootURL = new Path(local_path_on_the_machine);
>
> Injector injector = new Injector(conf);
> Generator generator = new Generator(conf);
> Fetcher fetcher = new Fetcher(conf);
> ParseSegment parseSegment = new ParseSegment(conf);
> CrawlDb crawlDbTool = new CrawlDb(conf);
> LinkDb linkDbTool = new LinkDb(conf);
> Indexer indexer = new Indexer(conf);
> DeleteDuplicates dedup = new DeleteDuplicates(conf);
> IndexMerger merger = new IndexMerger(conf);
>
> // initialize crawlDb
> injector.inject(crawlDb, rootURL);
> and so on...
>
> I keep getting
> Injector: starting
> Injector: crawlDb: crawl_db path
> Injector: urlDir: url path
> Injector: Converting injected urls to crawl db entries.
> Connection refused
>
> or
>
> Injector: starting
> Injector: crawlDb: crawldb path
> Injector: urlDir: url path
> Injector: Converting injected urls to crawl db entries.
> Input path doesnt exist : url path
>
> However, the url path does exist.
>
> Can someone give me pointers as to what's going on? Or perhaps give me
> pointers on how to store data on a local machine? I am not sure if this is
> the correct way of putting the data on the local machine.
>
> Thank you very much.
>
> Hanna