You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/06/16 18:00:50 UTC

[Nutch Wiki] Trivial Update of "DistributedWebDB" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "DistributedWebDB" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/DistributedWebDB?action=diff&rev1=2&rev2=3

   
  It's worthwhile to spend a little time on how the distributed WebDBWriter's inter-process communication works. No process ever opens a socket to communicate directly with another. Rather, all data is communicated via files. However, since the WebDBWriters exist over many machines and filesystems, these files need to be copied back and forth.
   
- We do that via the NutchFileSystem. Really, "filesystem" is overstating the case a little bit. Rather it's a mechanism for a shared file namespace, with some automatic file copying between machines that announce interest in that namespace.
+ We do that via the NutchDistributedFileSystem. Really, "filesystem" is overstating the case a little bit. Rather it's a mechanism for a shared file namespace, with some automatic file copying between machines that announce interest in that namespace.
   
- Every object under control of a NutchFileSystem machine group is represented with a "NutchFile" object. A NutchFile is named using three different parameters:
+ Every object under control of a NutchDistributedFileSystem machine group is represented with a "NutchFile" object. A NutchFile is named using three different parameters:
   
   * "dbName" indicates the overall database that the file belongs to. This exists to enable multiple Nutch instances simultaneously on the same machine set. All files within a given instance will have the same dbName.
   
-  * "shareGroupName" is used to control where the NutchFile will be copied. Clients of the NutchFileSystem ask for NutchFiles via sharegroup. Each machine in a NutchFileSystem machine group is configured so that it knows the entire (sharegroup<->machine) mapping.  For the purposes of the distributed WebDB, we create a sharegroup for each partition of the webdb. When a single WebDBWriter is emitting k separate edits files, it is writing to files in k different share groups. Having written everything out, the Writer demands to see any files meant for its particular segment's sharegroup. (We will also create a "master" sharegroup to contain the final db output.)
+  * "shareGroupName" is used to control where the NutchFile will be copied. Clients of the NutchDistributedFileSystem ask for NutchFiles via sharegroup. Each machine in a NutchDistributedFileSystem machine group is configured so that it knows the entire (sharegroup<->machine) mapping.  For the purposes of the distributed WebDB, we create a sharegroup for each partition of the webdb. When a single WebDBWriter is emitting k separate edits files, it is writing to files in k different share groups. Having written everything out, the Writer demands to see any files meant for its particular segment's sharegroup. (We will also create a "master" sharegroup to contain the final db output.)
   
   * "name" is just an arbitrary slash-separated filename. It describes a directory/filename hierarchy for the NutchFile in question.
   
- A NutchFile object can be resolved to a "real-world" disk File with the help of a local NutchFileSystem object. Each machine in a NutchFileSystem machine group has a NutchFileSystem object that handles configuration and other services. One such config value is the place on a local disk where the "root" of the NutchFileSystem is found. The disk File embodiment of a NutchFile is a combination of that root, the shareGroupName, and the name.
+ A NutchFile object can be resolved to a "real-world" disk File with the help of a local NutchDistributedFileSystem object. Each machine in a NutchDistributedFileSystem machine group has a NutchDistributedFileSystem object that handles configuration and other services. One such config value is the place on a local disk where the "root" of the NutchDistributedFileSystem is found. The disk File embodiment of a NutchFile is a combination of that root, the shareGroupName, and the name.
   
  (Of course, not all sharegroups' files will be present on each machine. That depends on the (sharegroup<->machine) mapping.)
   
- The NutchFileSystem also takes care of file moves, deleted, locking, and guaranteed atomicity.
+ The NutchDistributedFileSystem also takes care of file moves, deleted, locking, and guaranteed atomicity.
   
- It should be clear that the NutchFileSystem can be implemented across any group of machines that have mutual remote-access rights. It can also be used across machines that share mutual Network File System mounts.
+ It should be clear that the NutchDistributedFileSystem can be implemented across any group of machines that have mutual remote-access rights. It can also be used across machines that share mutual Network File System mounts.