You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/04/25 13:29:55 UTC

[Nutch Wiki] Update of "NutchDistributedFileSystem" by PiotrKosiorowski

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by PiotrKosiorowski:
http://wiki.apache.org/nutch/NutchDistributedFileSystem

The comment on the change is:
Package structure and example updated

------------------------------------------------------------------------------
  
  How much is sufficiently? That should be user-controlled, but right now it's hard-coded. The NDFS tries to make sure each Block exists on two Datanodes at any one time, though it will still operate if that's impossible. The numbers are low because a lot of Nutch users use just a few machines, where higher replication rates are impossible.
  
- However, NDFS has been designed with large installations in mind. I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. (Until we get a proper config interface in place, adjust the DESIRED_REPLICATION and MIN_REPLICATION constants in fs/FSNamesystem.java)
+ However, NDFS has been designed with large installations in mind. I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using "ndfs.replication" property, and  MIN_REPLICATION constant is located in ndfs/FSNamesystem.java (and set to 1 by default).
  
  = System Details =
  
- More details on NDFS operation are coming soon. For now, take a look at the following files, all in src/net/nutch/fs/*.java:
+ More details on NDFS operation are coming soon. For now, take a look at the following files, all in src/org/apache/nutch/ndfs/*.java:
  
  NDFS.java has two inner classes, each with a main(). One is for the NameNode?, and one is for the DataNode?. This file has all the network-handling code. Much of the work is handed to other classes.
  
@@ -100, +100 @@

  
  = System Integration =
  
- We are working to integrate the NDFS with Nutch filesystem stubs that we placed in the WebDB earlier (to support the distributed WebDB). It should be easy to add a switch to all Nutch tools to change quickly between the local filesystem and an NDFS installation.
+ Majority of nutch tools use Nutch``File``System abstraction to access files. There are currently two implementations available Local``File``System and NDFSFile``System. If not specified in command line arguments for tools using Nutch``File``System abstraction - filesystem implementation to be used is taken from config file property named "fs.default.name". Possible values of this property are "local" - for Local``File``System and "host:port" for NDFSFile``System. In the second case host and port values describe Name``Node location.
+ 
+ = Configuration properties =
+ 
+ NDFS related properties - description taken from config file:
+ 
+ "fs.default.name" - The name of the default file system.  Either the literal string "local" or a host:port for NDFS.
+ 
+ "ndfs.name.dir" - Determines where on the local filesystem the NDFS name node should store the name table.
+ 
+ "ndfs.data.dir" - Determines where on the local filesystem the NDFS name node should store the name table.
+ 
+ "ndfs.replication" - how many copies we try to have at all times (not present in config file)
  
  = Quick Demo =
  
- On machine A, run: $ java net.nutch.fs.NDFS$NameNode 9000 namedir
+ On machines A,B,C in nutch config file set:
  
- On machine B, run: $ java net.nutch.fs.NDFS$DataNode datadir1 machineB 8000 machineA:9000
+ fs.default.name = A:9000
  
- On machine C, run: $ java net.nutch.fs.NDFS$DataNode datadir2 machineC 8000 machineA:9000
+ ndfs.name.dir=/tmp/nutch/ndfs/name
  
- You now have an NDFS installation with one NameNode? and two DataNodes?. (Note, of course, you don't have to run these on different machines. It's enough to use different directories and avoid port conflicts.)
+ ndfs.data.dir=/tmp/nutch/ndfs/data
  
- Anywhere, run the client: $ java net.nutch.fs.TestClient machineA:9000 CREATE foo.txt $ java net.nutch.fs.TestClient machineA:9000 GET foo.txt $ java net.nutch.fs.TestClient machineA:9000 RENAME foo.txt bar.txt $ java net.nutch.fs.TestClient machineA:9000 GET bar.txt $ java net.nutch.fs.TestClient machineA:9000 DELETE bar.txt
  
- You have just created a large file, retrieved it, renamed it, retrieved it again, and deleted it.
+ 
+ On machine A, run: $ nutch namenode 
+ 
+ On machine B, run: $ nutch datanode
+ 
+ On machine C, run: $ nutch datanode 
+ 
+ You now have an NDFS installation with one NameNode? and two DataNodes?. (Note, of course, you don't have to run these on different machines. It's enough to use different directories and avoid port conflicts.) DataNodes use port 7000 or greater (they probe to find free port to listen on starting from 7000).
+ 
+ Anywhere, run the client (having fs.default.name = A:9000 in nutch config file): 
+ 
+ $ nutch org.apache.nutch.fs.Test``Client 
+ 
+ It will display possible NDFS operations to be performed using this test tool.
+ 
+ So to test basic NDFS operation we can execute:
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -mkdir /test
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -ls /
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -put local_file /test/testfile
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -ls /test
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -cp /test/testfile /test/backup
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -rm /test/testfile
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -mv /test/backup /test/testfile
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -get /test/testfile local_copy
+ 
+ 
+ You have just created a directory, listed its contents, copied a file from local filesystem into it, listed it again, copied it in NDFS, removed original, renamed backup to original name and retrieved a copy from NDFS to local file system.
+ 
+ There are also additional commands that allow you to inspect the state of NDFS:
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -report
+ 
+ $ nutch org.apache.nutch.fs.Test``Client -du /
  
  You might try interesting things like the following:
   1. Start a NameNode? and one DataNode?
@@ -126, +177 @@

  
  The system should have replicated the relevant blocks, making the data still available in step 6.
  
- If you want to read/write directly, use the API exposed in net.nutch.fs.NDFSClient
+ If you want to read/write directly, use the API exposed in org.apache.nutch.ndfs.NDFSClient
  
  = Conclusion =