You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Carl Dorestos <ca...@gmail.com> on 2006/04/13 03:12:09 UTC

Help needed - how to import local files into Nutch 0.8?

I need to index 100s of GBs of documents that I already have on a
local filesystem in my site. I need the content and the index to be
distributed on dfs for distributed search.

What is the best way to import these files (all html docs) into nutch
0.8using dfs and mapred?

I tried putting the files on an http server in my site, then crawling the
files from my dfs/mapred nutch cluster.
-The servers are connected by 1 Gbit/s eathernet, but I could only get crawl
bandwidth of 200 kb/s.
- It is not a cpu utilization issue. I checked the cpu utilization on the
slaves, and it was low as expected (5%-10%).
- The crawl doesn't go through a firewall.
- The crawl-urlfilter.txt file is very simple with a few lines
- Is it a politeness issue? If so how to override the politeness settings?

I'd appreciate your help.

Carl

Re: Help needed - how to import local files into Nutch 0.8?

Posted by Doug Cutting <cu...@apache.org>.
Carl Dorestos wrote:
> - Is it a politeness issue? If so how to override the politeness settings?

To disable politeness, you would change fetcher.server.delay to zero and 
fetcher.threads.per.host to something larger than fetcher.threads.fetch.

Doug

Re: Help needed - how to import local files into Nutch 0.8?

Posted by sudhendra seshachala <su...@yahoo.com>.
Please refer to http://www.mail-archive.com/nutch-user@lucene.apache.org/msg04056.html
   
  I hope you find it useful.
   
  Just follow every instruction there.
   
  Let me know, if you need anything else. 
  Thanks
  Sudhi

Carl Dorestos <ca...@gmail.com> wrote:
  I need to index 100s of GBs of documents that I already have on a
local filesystem in my site. I need the content and the index to be
distributed on dfs for distributed search.

What is the best way to import these files (all html docs) into nutch
0.8using dfs and mapred?

I tried putting the files on an http server in my site, then crawling the
files from my dfs/mapred nutch cluster.
-The servers are connected by 1 Gbit/s eathernet, but I could only get crawl
bandwidth of 200 kb/s.
- It is not a cpu utilization issue. I checked the cpu utilization on the
slaves, and it was low as expected (5%-10%).
- The crawl doesn't go through a firewall.
- The crawl-urlfilter.txt file is very simple with a few lines
- Is it a politeness issue? If so how to override the politeness settings?

I'd appreciate your help.

Carl



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
How low will we go? Check out Yahoo! Messenger’s low  PC-to-Phone call rates.