You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/11/11 11:48:05 UTC

[Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by PaulBaclace:
http://wiki.apache.org/nutch/OverviewDeploymentConfigs

New page:
== Overview of Deployment Configurations in Nutch 0.8 ==
(11/2005 Paul Baclace)

This page describes a range of deployment configurations, the assumptions involved, and the relevant property settings.  The primary focus is on a few canonical deployments scenarios and surrounding issues.  Relevant properties are described, but a complete description of all properties is not attempted here.

The process startup sequence is also described in order to see differences between different deployments.

Flexibility of assumptions is noted with MUST (rigid) or SHOULD (highly recommended, but could be different for the adventurous). 

=== Configuration File Overview ===

When building Nutch, the conf directory has 2 important property files that are put into the classpath for lookup at runtime:

 * ''nutch-default.xml'' the place for universal defaults as set by the Nutch developers.
 * ''nutch-site.xml'' the highest priority properties that override all other.

The Java System Properties are ''not'' consulted for Nutch properties, so -D style commandline overriding is strongly discouraged.  However, System Properties are used when standard properties are to be found.  

The bin/nutch sh script places $NUTCH_HOME/conf at the beginning of the classpath so that the xml property files can be found.

=== Nutch Shell Scripts ===

A meta-assumption here is that the sh scripts in the nutch bin directory are used to start and control the ensemble of processes across many machines.  

The Nutch shell scripts are simple and elegant and they form a call hierarchy, starting at the top level:
 1. start_all.sh or stop_all.sh - start and stop whole ensemble.
 2. nutch_daemons.sh - run a Nutch command on all slave hosts.
 3. slaves.sh - run a shell command on all slave hosts.
 4. nutch_daemon.sh - run a Nutch command as a daemon with a start|stop argument like a regular Unix/Linux /etc/rc.local script; the process pid is stored during start and used during stop.  Runs rsync at start.
 5. nutch - run a Nutch command using the JVM.

Depending upon the context of use, any level of these scripts can be handy on the command line.

=== Configuration Assumptions ===

For simplicity of configuration, filenames you pass to commands SHOULD be pathnames that work on all hosts. When working with just a few hosts, this seems to be a limitation, but it obviously makes a lot of sense when hundreds or thousands of machines are involved.

 1. property settings are meant to be the same across hosts; they are SHOULD not be customized per host (they are not even settable on the commandline, so per-process settings are discouraged).
 2. filenames and paths are meant to be the same across hosts (SHOULD).
 3. Some file paths are ambivalent about NDFS/local filesystem and are interpreted depending on which kind of filesystem is in use.
 4. each machine SHOULD have (including the master) nutch installed in the same filesystem path.
 5. The ndfs.data.dir and mapred.local.dir properties list comma separated directories.  Only those that exist are used.  So not all machines are required to have exactly the same devices.

=== System Assumptions ===

  1. The env var NUTCH_MASTER is set to the hostname of the master machine.
  2. The slave nodes are defined by putting list of hostnames, one per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer to a different file).
  3. a cluster of machines is managed from a master machine, without a firewall in bewteen any of the machines (MUST, for simplicity).  Many tcp/ip ports are used.
  4. the master machine MUST have a no-password login (ssh) to all the slave machines, using the same username.
  5. set environment variables in ~/.ssh/environment, since ssh does not source your .bash_profile.  These include JAVA_HOME, NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
  6. make sure that your NUTCH_LOG_DIR and the directories named in ndfs.data.dir exist on all slaves.  This can be done most easily with bin/slaves.sh.

=== Deployment Startup Sequences ===

 A. Cluster deployment with too many machines to customize (probably more than 4; 1000 machines should be possible):

  6. bin/slaves.sh rsync-command is used as needed to update jars and conf files from master.
  7. the ensemble starts by running bin/start-all.sh on the master.
  8. start-all.sh uses bin/nutch-daemons.sh run one datanode process on each slave (in the background without waiting, one daemon thread is started per comma-separated storage device, non-existent storage devices in the list are ignored).
  9. start-all.sh runs one namenode and one jobtracker on the master.
  10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker process on each slave (in the background without waiting).


 B. Cluster of a few machines:
  1. ''Add more details here''

 C. One developer debugging on one machine:
  1. ''Add more details here''

Re: [Nutch-cvs] [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Posted by Andrzej Bialecki <ab...@getopt.org>.
Apache Wiki wrote:

>  4. the master machine MUST have a no-password login (ssh) to all the slave machines, using the same username.
>  
>

Actually, you can publish the public key of the account on the master 
machine to all nodes (put it in ~/.ssh/authorized_keys) - this way the 
accounts are still protected, but the authentication is performed using 
private/public keys, and doesn't require any interaction either.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Posted by Doug Cutting <cu...@nutch.org>.
Great stuff, Paul!

A few minor corrections.

Apache Wiki wrote:
>   1. The env var NUTCH_MASTER is set to the hostname of the master machine.

This is optional.  The alternative is to mount a common home directory 
with NFS, as many clusters do, and keep the Nutch software there.

Also, NUTCH_MASTER is an rsync path, so it should be set to something of 
the form host:/path/to/nutch, e.g., "foo.bar.com:/home/$USER/src/nutch".

>   2. The slave nodes are defined by putting list of hostnames, one per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer to a different file).

This location can be altered with the environment variable NUTCH_SLAVES.

Thanks for writing this.

Doug

Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Posted by Stefan Groschupf <sg...@media-style.com>.
ups, sorry...
Paul, you may should mentioned that this scripts require ssh in a  
version higher than 3.8.
A great page!

Stefan

Am 11.11.2005 um 13:45 schrieb Stefan Groschupf:

>
> Am 11.11.2005 um 11:48 schrieb Apache Wiki:
>
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Nutch  
>> Wiki" for change notification.
>>
>> The following page has been changed by PaulBaclace:
>> http://wiki.apache.org/nutch/OverviewDeploymentConfigs
>>
>> New page:
>> == Overview of Deployment Configurations in Nutch 0.8 ==
>> (11/2005 Paul Baclace)
>>
>> This page describes a range of deployment configurations, the  
>> assumptions involved, and the relevant property settings.  The  
>> primary focus is on a few canonical deployments scenarios and  
>> surrounding issues.  Relevant properties are described, but a  
>> complete description of all properties is not attempted here.
>>
>> The process startup sequence is also described in order to see  
>> differences between different deployments.
>>
>> Flexibility of assumptions is noted with MUST (rigid) or SHOULD  
>> (highly recommended, but could be different for the adventurous).
>>
>> === Configuration File Overview ===
>>
>> When building Nutch, the conf directory has 2 important property  
>> files that are put into the classpath for lookup at runtime:
>>
>>  * ''nutch-default.xml'' the place for universal defaults as set  
>> by the Nutch developers.
>>  * ''nutch-site.xml'' the highest priority properties that  
>> override all other.
>>
>> The Java System Properties are ''not'' consulted for Nutch  
>> properties, so -D style commandline overriding is strongly  
>> discouraged.  However, System Properties are used when standard  
>> properties are to be found.
>>
>> The bin/nutch sh script places $NUTCH_HOME/conf at the beginning  
>> of the classpath so that the xml property files can be found.
>>
>> === Nutch Shell Scripts ===
>>
>> A meta-assumption here is that the sh scripts in the nutch bin  
>> directory are used to start and control the ensemble of processes  
>> across many machines.
>>
>> The Nutch shell scripts are simple and elegant and they form a  
>> call hierarchy, starting at the top level:
>>  1. start_all.sh or stop_all.sh - start and stop whole ensemble.
>>  2. nutch_daemons.sh - run a Nutch command on all slave hosts.
>>  3. slaves.sh - run a shell command on all slave hosts.
>>  4. nutch_daemon.sh - run a Nutch command as a daemon with a start| 
>> stop argument like a regular Unix/Linux /etc/rc.local script; the  
>> process pid is stored during start and used during stop.  Runs  
>> rsync at start.
>>  5. nutch - run a Nutch command using the JVM.
>>
>> Depending upon the context of use, any level of these scripts can  
>> be handy on the command line.
>>
>> === Configuration Assumptions ===
>>
>> For simplicity of configuration, filenames you pass to commands  
>> SHOULD be pathnames that work on all hosts. When working with just  
>> a few hosts, this seems to be a limitation, but it obviously makes  
>> a lot of sense when hundreds or thousands of machines are involved.
>>
>>  1. property settings are meant to be the same across hosts; they  
>> are SHOULD not be customized per host (they are not even settable  
>> on the commandline, so per-process settings are discouraged).
>>  2. filenames and paths are meant to be the same across hosts  
>> (SHOULD).
>>  3. Some file paths are ambivalent about NDFS/local filesystem and  
>> are interpreted depending on which kind of filesystem is in use.
>>  4. each machine SHOULD have (including the master) nutch  
>> installed in the same filesystem path.
>>  5. The ndfs.data.dir and mapred.local.dir properties list comma  
>> separated directories.  Only those that exist are used.  So not  
>> all machines are required to have exactly the same devices.
>>
>> === System Assumptions ===
>>
>>   1. The env var NUTCH_MASTER is set to the hostname of the master  
>> machine.
>>   2. The slave nodes are defined by putting list of hostnames, one  
>> per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer  
>> to a different file).
>>   3. a cluster of machines is managed from a master machine,  
>> without a firewall in bewteen any of the machines (MUST, for  
>> simplicity).  Many tcp/ip ports are used.
>>   4. the master machine MUST have a no-password login (ssh) to all  
>> the slave machines, using the same username.
>>   5. set environment variables in ~/.ssh/environment, since ssh  
>> does not source your .bash_profile.  These include JAVA_HOME,  
>> NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
>>   6. make sure that your NUTCH_LOG_DIR and the directories named  
>> in ndfs.data.dir exist on all slaves.  This can be done most  
>> easily with bin/slaves.sh.
>>
>> === Deployment Startup Sequences ===
>>
>>  A. Cluster deployment with too many machines to customize  
>> (probably more than 4; 1000 machines should be possible):
>>
>>   6. bin/slaves.sh rsync-command is used as needed to update jars  
>> and conf files from master.
>>   7. the ensemble starts by running bin/start-all.sh on the master.
>>   8. start-all.sh uses bin/nutch-daemons.sh run one datanode  
>> process on each slave (in the background without waiting, one  
>> daemon thread is started per comma-separated storage device, non- 
>> existent storage devices in the list are ignored).
>>   9. start-all.sh runs one namenode and one jobtracker on the master.
>>   10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker  
>> process on each slave (in the background without waiting).
>>
>>
>>  B. Cluster of a few machines:
>>   1. ''Add more details here''
>>
>>  C. One developer debugging on one machine:
>>   1. ''Add more details here''
>>
>
>


Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Posted by Stefan Groschupf <sg...@media-style.com>.
Am 11.11.2005 um 11:48 schrieb Apache Wiki:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Nutch Wiki"  
> for change notification.
>
> The following page has been changed by PaulBaclace:
> http://wiki.apache.org/nutch/OverviewDeploymentConfigs
>
> New page:
> == Overview of Deployment Configurations in Nutch 0.8 ==
> (11/2005 Paul Baclace)
>
> This page describes a range of deployment configurations, the  
> assumptions involved, and the relevant property settings.  The  
> primary focus is on a few canonical deployments scenarios and  
> surrounding issues.  Relevant properties are described, but a  
> complete description of all properties is not attempted here.
>
> The process startup sequence is also described in order to see  
> differences between different deployments.
>
> Flexibility of assumptions is noted with MUST (rigid) or SHOULD  
> (highly recommended, but could be different for the adventurous).
>
> === Configuration File Overview ===
>
> When building Nutch, the conf directory has 2 important property  
> files that are put into the classpath for lookup at runtime:
>
>  * ''nutch-default.xml'' the place for universal defaults as set by  
> the Nutch developers.
>  * ''nutch-site.xml'' the highest priority properties that override  
> all other.
>
> The Java System Properties are ''not'' consulted for Nutch  
> properties, so -D style commandline overriding is strongly  
> discouraged.  However, System Properties are used when standard  
> properties are to be found.
>
> The bin/nutch sh script places $NUTCH_HOME/conf at the beginning of  
> the classpath so that the xml property files can be found.
>
> === Nutch Shell Scripts ===
>
> A meta-assumption here is that the sh scripts in the nutch bin  
> directory are used to start and control the ensemble of processes  
> across many machines.
>
> The Nutch shell scripts are simple and elegant and they form a call  
> hierarchy, starting at the top level:
>  1. start_all.sh or stop_all.sh - start and stop whole ensemble.
>  2. nutch_daemons.sh - run a Nutch command on all slave hosts.
>  3. slaves.sh - run a shell command on all slave hosts.
>  4. nutch_daemon.sh - run a Nutch command as a daemon with a start| 
> stop argument like a regular Unix/Linux /etc/rc.local script; the  
> process pid is stored during start and used during stop.  Runs  
> rsync at start.
>  5. nutch - run a Nutch command using the JVM.
>
> Depending upon the context of use, any level of these scripts can  
> be handy on the command line.
>
> === Configuration Assumptions ===
>
> For simplicity of configuration, filenames you pass to commands  
> SHOULD be pathnames that work on all hosts. When working with just  
> a few hosts, this seems to be a limitation, but it obviously makes  
> a lot of sense when hundreds or thousands of machines are involved.
>
>  1. property settings are meant to be the same across hosts; they  
> are SHOULD not be customized per host (they are not even settable  
> on the commandline, so per-process settings are discouraged).
>  2. filenames and paths are meant to be the same across hosts  
> (SHOULD).
>  3. Some file paths are ambivalent about NDFS/local filesystem and  
> are interpreted depending on which kind of filesystem is in use.
>  4. each machine SHOULD have (including the master) nutch installed  
> in the same filesystem path.
>  5. The ndfs.data.dir and mapred.local.dir properties list comma  
> separated directories.  Only those that exist are used.  So not all  
> machines are required to have exactly the same devices.
>
> === System Assumptions ===
>
>   1. The env var NUTCH_MASTER is set to the hostname of the master  
> machine.
>   2. The slave nodes are defined by putting list of hostnames, one  
> per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer  
> to a different file).
>   3. a cluster of machines is managed from a master machine,  
> without a firewall in bewteen any of the machines (MUST, for  
> simplicity).  Many tcp/ip ports are used.
>   4. the master machine MUST have a no-password login (ssh) to all  
> the slave machines, using the same username.
>   5. set environment variables in ~/.ssh/environment, since ssh  
> does not source your .bash_profile.  These include JAVA_HOME,  
> NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
>   6. make sure that your NUTCH_LOG_DIR and the directories named in  
> ndfs.data.dir exist on all slaves.  This can be done most easily  
> with bin/slaves.sh.
>
> === Deployment Startup Sequences ===
>
>  A. Cluster deployment with too many machines to customize  
> (probably more than 4; 1000 machines should be possible):
>
>   6. bin/slaves.sh rsync-command is used as needed to update jars  
> and conf files from master.
>   7. the ensemble starts by running bin/start-all.sh on the master.
>   8. start-all.sh uses bin/nutch-daemons.sh run one datanode  
> process on each slave (in the background without waiting, one  
> daemon thread is started per comma-separated storage device, non- 
> existent storage devices in the list are ignored).
>   9. start-all.sh runs one namenode and one jobtracker on the master.
>   10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker  
> process on each slave (in the background without waiting).
>
>
>  B. Cluster of a few machines:
>   1. ''Add more details here''
>
>  C. One developer debugging on one machine:
>   1. ''Add more details here''
>