You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/08/24 19:13:47 UTC
[Nutch Wiki] Update of "NutchHadoopTutorial" by DennisKubes

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NutchHadoopTutorial

The comment on the change is:
Fixed searching and added distributed searching instructions

------------------------------------------------------------------------------
    /search
      (nutch installation goes here)
    /filesystem
+   /local (used for local directory for searching)
    /home
      (nutch user's home directory)
    /tomcat    (only on one server for searching)
@@ -132, +133 @@

  mkdir /nutch
  mkdir /nutch/search
  mkdir /nutch/filesystem
+ mkdir /nutch/local
  mkdir /nutch/home
  
  groupadd users
@@ -454, +456 @@

  You can also startup new terminals into the slave machine and tail the log files to see detailed output for that slave node.  The crawl will probably take a while to complete.  When it is done we are ready to do the search.
  
  
- == Performing a Search with Hadoop and Nutch ==
+ == Performing a Search ==
  --------------------------------------------------------------------------------
+ To perform a search on the index we just created within the distributed filesystem we need to do two things.  First we need to pull the index to a local filesystem and second we need to setup and configure the nutch war file.  Although technically possible, it is not advisable to do searching using the distributed filesystem.  
+ 
+ The DFS is great for holding the results of the MapReduce processes including the completed index, but for searching it simply takes too long.  In a production system you are going to want to create the indexes using the MapReduce system and store the result on the DFS.  Then you are going to want to copy those indexes to a local filesystem for searching.  If the indexes are too big (i.e. you have a 100 million page index), you are going to want to break the index up into multiple pieces (1-2 million pages each), copy the index pieces to local filesystems from the DFS and have multiple search servers read from those local index pieces.  A full distributed search setup is the topic of another tutorial but for now realize that you don't want to search using DFS, you want to search using local filesystems.  
+ 
+ Once the index has been created on the DFS you can use the hadoop copyToLocal command to move it to the local file system as such.
+ 
+ {{{
+ bin/hadoop dfs -copyToLocal crawled /d01/local/
+ }}}
+ 
+ Your crawl directory should have an index directory which should contain the actual index files.  Later when working with Nutch and Hadoop if you have an indexes directory with folders such as part-xxxxx inside of it you can use the nutch merge command to merge segment indexes into a single index.  The search website when pointed to local will look for a directory in which there is an index folder that contains merged index files or an indexes folder that contains segment indexes.  This can be a tricky part because your search website can be working properly but if it doesn't find the indexes, all searches will return nothing.
+ 
- To perform a search on the index we just created within the distributed filesystem we first need to setup and configure the nutch war file.  If you setup the tomcat server as we stated earlier then you should have a tomcat installation under /nutch/tomcat and in the webapps directory you should have a folder called ROOT with the nutch war file unzipped inside of it.  Now we just need to configure the application to use the distributed filesystem for searching.  We do this by editing the hadoop-site.xml file under the WEB-INF/classes directory.  Use the following commands:
+ If you setup the tomcat server as we stated earlier then you should have a tomcat installation under /nutch/tomcat and in the webapps directory you should have a folder called ROOT with the nutch war file unzipped inside of it.  Now we just need to configure the application to use the distributed filesystem for searching.  We do this by editing the hadoop-site.xml file under the WEB-INF/classes directory.  Use the following commands:
  
  {{{
  cd /nutch/tomcat/webapps/ROOT/WEB-INF/classes
- vi hadoop-site.xml
+ vi nutch-site.xml
  }}}
  
- Below is an template hadoop-site.xml file:
+ Below is an template nutch-site.xml file:
  
  {{{
  <?xml version="1.0"?>
@@ -473, +487 @@

  
    <property>
      <name>fs.default.name</name>
-     <value>devcluster01:9000</value>
+     <value>local</value>
    </property>
  
    <property>
      <name>searcher.dir</name>
-     <value>crawled</value>
+     <value>/d01/local/crawled</value>
    </property>
  
  </configuration>
  }}}
  
- The fs.default.name property as before is pointed to our name node.
+ The fs.default.name property is now pointed locally for searching the local index.  Understand that at this point we are not using the DFS or MapReduce to do the searching, all of it is on a local machine.
  
- The searcher.dir directory is the directory that we specified in the distributed filesystem under which the index was stored.  In our crawl command earlier we used the crawled directory.
+ The searcher.dir directory is the directory where the index and resulting database are stored on the local filesystem.  In our crawl command earlier we used the crawled directory which stored the results in crawled on the DFS.  Then we copied the crawled folder to our /d01/local directory on the local fileystem.  So here we point this property to /d01/local/crawled.  The directory which it points to should contain not just the index directory but also the linkdb, segments, etc.  All of these different databases are used by the search.  This is why we copied over the crawled directory and not just the index directory.
  
- Once the hadoop-site.xml file is edited then the application should be ready to go.  You can start tomcat with the following command:
+ Once the nutch-site.xml file is edited then the application should be ready to go.  You can start tomcat with the following command:
  
  {{{
  cd /nutch/tomcat
  bin/startup.sh
  }}}
  
- Then point you browser to http://devcluster01:8080 (your master node) to see the Nutch search web application.  If everything has been configured correctly then you should be able to enter queries and get results.
+ Then point you browser to http://devcluster01:8080 (your search server) to see the Nutch search web application.  If everything has been configured correctly then you should be able to enter queries and get results.  If the website is working but you are getting no results it probably has to do with the index directory not being found. The searcher.dir property must be pointed to the parent of the index directory.  That parent must also contain the segments, linkdb, and crawldb folders from the crawl.  The index folder must be named index and contain merged segment indexes, meaning the index files are in the index directory and not in a directory below index named part-xxxx for example, or the index directory must be named indexes and contain segment indexes of the name part-xxxxx which hold the index files.  I have had better luck with merged indexes than with segment indexes.
  
+ == Distributed Searching ==
+ --------------------------------------------------------------------------------
+ Although not really the topic of this tutorial, distributed searching needs to be addressed.  In a production system, you would create your indexes and corresponding databases (i.e. crawldb) using the DFS and MapReduce, but you would search them using local filesystems on dedicated search servers for speed and to avoid network overhead.
+ 
+ Briefly here is how you would setup distributed searching.  Inside of the tomcate WEB-INF/classes directory in the nutch-site.xml file you would point the searcher.dir property to a file that contains a search-servers.txt file.  The search servers.txt file would look like this.
+ 
+ {{{
+ devcluster01 1234
+ devcluster01 5678
+ devcluster02 9101
+ }}}
+ 
+ Each line contains a machine name and port that represents a search server.  This tells the website to connect to search servers on those machines at those ports.
+ 
+ On each of the search servers you would use the startup the distributed search server by using the nutch server command like this:
+ 
+ {{{
+ bin/nutch server 1234 /d01/local/crawled
+ }}}
+ 
+ The arguments are the port to start the server on which must correspond with what you put into the search-servers.txt file and the local directory that is the parent of the index folder. Once the distributed search servers are started on each machine you can startup the website.  Searching should then happen normally with the exception of search results being pulled from the distributed search server indexes.
+ 
+ There is no command to shutdown the distributed search server process, you will simply have to kill it by hand.  The tomcat logs for the website should show how many servers and segments it is connected to at any one time.  The good news is that the website polls the servers in its search-servers.txt file to constantly check if they are up so you can shut down a single distributed search server, change out its index and bring it back up and the website will reconnect automatically.  This was they entire search is never down at any one point in time, only specific parts of the index would be down.
+ 
+ In a production environment searching is the biggest cost both in machines and electricity.  The reason is that once an index piece gets beyond about 2 million pages it takes too much time to read from the disk so you can have a 100 million page index on a single machine no matter how big the hard disk is.  Fortunately using the distributed searching you can have multiple dedicated search servers each with their own piece of the index that are searched in parallel.  This allow very large index system to be searched efficiently.
+ 
+ Doing the math, a 100 million page system would take about 50 dedicated search servers to serve 20+ queries per second.  One way to get around having to have so many machines is by using multi-processor machine with multiple disks running multiple search servers each using a separate disk and index.  Going down this route you can cut machine cost down by as much as 50% and electricity costs down by as much as 75%.  A multi-disk machine can't handle the same number of queries per second as a dedicated single disk machine but the number of index pages it can handle is significantly greater so it averages out to be much more efficient.
  
  == Rsyncing Code to Slaves ==
  --------------------------------------------------------------------------------
@@ -549, +590 @@

  
  If you have any comments or suggestions feel free to email them to me at nutch-dev@dragonflymc.com.  If you have questions about Nutch or Hadoop they should be addressed to their respective mailing lists.  Below are general resources that are helpful with operating and developing Nutch and Hadoop.
  
+ == Updates ==
+ --------------------------------------------------------------------------------
+  * I don't use rsync to sync code between the servers any more.  Now I am using expect scripts and python scripts to manage and automate the system.
+ 
+  * I use distributed searching with 1-2 million pages per index piece.  We now have servers with multiple processors and multiple disks (4 per machine) running multiple search servers (1 per disk) to decrease cost and power requirements.  With this a single server holding 8 million pages can serve 10 queries a second constant.
+ 
+ 
  == Resources ==
  --------------------------------------------------------------------------------
  Google MapReduce Paper:
@@ -583, +631 @@

    * When you first start up hadoop, there's a warning in the namenode log, "dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove e:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does not exist" - You can ignore that.
    * If you get errors like, "failed to create file [...] on client [foo] because target-length is 0, below MIN_REPLICATION (1)" this means a block could not be distributed. Most likely there is no datanode running, or the datanode has some severe problem (like the lock problem mentioned above).
  
-  * The tutorial says you should point the searcher to the DFS namenode. This seems to be pretty inefficient; in a real distributed case you would need to set up distributed searchers and avoid network I/O for the DFS. It would be nice if this could be addressed in a future version of this tutorial.  
-