You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/08/21 18:49:50 UTC

[Nutch Wiki] Update of "NutchHadoopTutorial" by ClemensMarschner

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by ClemensMarschner:
http://wiki.apache.org/nutch/NutchHadoopTutorial

------------------------------------------------------------------------------
  
  http://www.netlikon.de/docs/javadoc-hadoop-0.1/overview-summary.html
  
+ ----
  
- * - I, StephenHalsey, have used this tutorial and found it very useful, but when I tried to add additional datanodes I got error messages in the logs of those datanodes saying "2006-07-07 18:58:18,345 INFO org.apache.hadoop.dfs.DataNode: Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node linux89-1:50010is attempting to report storage ID DS-1437847760. Expecting DS-1437847760.".  I think this was because the hadoop/filesystem/data/storage file was the same on the new data nodes and they had the same data as the one that had been copied from the original.  To get round this I turned everything off using bin/stop-all.sh on the name-node and deleted everything in the /filesystem directory on the new datanodes so they were clean and ran bin/start-all.sh on the namenode and then saw that the filesystem on the new datanodes had been created with new hadoop/filesystem/data/storage files and new directories and everything
  seemed to work fine from then on.  This probably is not a problem if you do follow the above process without starting any datanodes because they will all be empty, but was for me because I put some data onto the dfs of the single datanode system before copying it all onto the new datanodes.  I am not sure if I made some other error in following this process, but I have just added this note in case people who read this document experience the same problem.  Well done for the tutorial by the way, very helpful. Steve.
+  * - I, StephenHalsey, have used this tutorial and found it very useful, but when I tried to add additional datanodes I got error messages in the logs of those datanodes saying "2006-07-07 18:58:18,345 INFO org.apache.hadoop.dfs.DataNode: Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node linux89-1:50010is attempting to report storage ID DS-1437847760. Expecting DS-1437847760.".  I think this was because the hadoop/filesystem/data/storage file was the same on the new data nodes and they had the same data as the one that had been copied from the original.  To get round this I turned everything off using bin/stop-all.sh on the name-node and deleted everything in the /filesystem directory on the new datanodes so they were clean and ran bin/start-all.sh on the namenode and then saw that the filesystem on the new datanodes had been created with new hadoop/filesystem/data/storage files and new directories and everythin
 g seemed to work fine from then on.  This probably is not a problem if you do follow the above process without starting any datanodes because they will all be empty, but was for me because I put some data onto the dfs of the single datanode system before copying it all onto the new datanodes.  I am not sure if I made some other error in following this process, but I have just added this note in case people who read this document experience the same problem.  Well done for the tutorial by the way, very helpful. Steve.
  
+ ----
+ 
+  * nice tutorial! I tried to set it up without having fresh boxes available, just for testing (nutch 0.8). I ran into a few problems. But I finally got it to work. Some gotchas:
+   * use absolute paths for the DFS locations. Sounds strange that I used this, but I wanted to set up a single hadoop node on my Windows laptop, then extend on a Linux box. So relative path names would have come in handy, as they would be the same for both machines. Don't try that. Won't work. The DFS showed a ".." directory which disappeared when I switched to absolute paths.
+   * I had problems getting DFS to run on Windows at all. I always ended up getting this exception: "Could not complete write to file e:/dev/nutch-0.8/filesystem/mapreduce/system/submit_2twsuj/.job.jar.crc by DFSClient_-1318439814 - seems nutch hasn't been tested much on Windows. So, use Linux.
+   * don't use DFS on an NFS mount (this would be pretty stupid anyway, but just for testing, one might just set it up into an NFS homre directory). DFS uses locks, and NFS may be configured to not allow them.
+   * When you first start up hadoop, there's a warning in the namenode log, "dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove e:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does not exist" - You can ignore that.
+   * If you get errors like, "failed to create file [...] on client [foo] because target-length is 0, below MIN_REPLICATION (1)" this means a block could not be distributed. Most likely there is no datanode running, or the datanode has some severe problem (like the lock problem mentioned above).
+ 
+  * The tutorial says you should point the searcher to the DFS namenode. This seems to be pretty inefficient; in a real distributed case you would need to set up distributed searchers and avoid network I/O for the DFS. It would be nice if this could be addressed in a future version of this tutorial.  
+