You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/10/26 02:36:30 UTC

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by RobHunter

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchHadoopTutorial" page has been changed by RobHunter.
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=25&rev2=26

--------------------------------------------------

  
  Each line contains a machine name and port that represents a search server.  This tells the website to connect to search servers on those machines at those ports.
  
- On each of the search servers, since we are searching local directories to search, you would need to make sure that the filesystem in the nutch-site.xml file is pointing to local.  One of the problems that I can across is that I was using the same nutch distribution to act as a slave node for DFS and MR as I was using to run the distributed search server.  The problem with this was that when the distributed search server started up it was looking in the DFS for the files to read.  It couldn't find them and I would get log messages saying x servers with 0 segments.  
+ On each of the search servers, since we are searching local directories, you would need to make sure that the filesystem in the nutch-site.xml file is pointing to local.  One of the problems that I came across is that I was using the same nutch distribution to act as a slave node for DFS and MR as I was using to run the distributed search server.  The problem with this was that when the distributed search server started up it was looking in the DFS for the files to read.  It couldn't find them and I would get log messages saying x servers with 0 segments.  
  
  I found it easiest to create another nutch distribution in a separate folder.  I would then start the distributed search server from this separate distribution.  I just used the default nutch-site.xml and hadoop-site.xml files which have no configuration.  This defaults the filesystem to local and the distributed search server is able to find the files it needs on the local box.  
  
@@ -617, +617 @@

  
  The arguments are the port to start the server on which must correspond with what you put into the search-servers.txt file and the local directory that is the parent of the index folder. Once the distributed search servers are started on each machine you can startup the website.  Searching should then happen normally with the exception of search results being pulled from the distributed search server indexes.  In the logs on the search website (usually catalina.out file), you should see messages telling you the number of servers and segments the website is attached to and searching.  This will allow you to know if you have your setup correct.
  
- There is no command to shutdown the distributed search server process, you will simply have to kill it by hand.  The good news is that the website polls the servers in its search-servers.txt file to constantly check if they are up so you can shut down a single distributed search server, change out its index and bring it back up and the website will reconnect automatically.  This was they entire search is never down at any one point in time, only specific parts of the index would be down.
+ There is no command to shutdown the distributed search server process, you will simply have to kill it by hand.  The good news is that the website polls the servers in its search-servers.txt file to constantly check if they are up so you can shut down a single distributed search server, change out its index and bring it back up and the website will reconnect automatically.  This way the entire search is never down at any one point in time, only specific parts of the index would be down.
  
  In a production environment searching is the biggest cost both in machines and electricity.  The reason is that once an index piece gets beyond about 2 million pages it takes too much time to read from the disk so you can have a 100 million page index on a single machine no matter how big the hard disk is.  Fortunately using the distributed searching you can have multiple dedicated search servers each with their own piece of the index that are searched in parallel.  This allow very large index system to be searched efficiently.