You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/11/11 13:45:20 UTC
Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace
Am 11.11.2005 um 11:48 schrieb Apache Wiki:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Nutch Wiki"
> for change notification.
>
> The following page has been changed by PaulBaclace:
> http://wiki.apache.org/nutch/OverviewDeploymentConfigs
>
> New page:
> == Overview of Deployment Configurations in Nutch 0.8 ==
> (11/2005 Paul Baclace)
>
> This page describes a range of deployment configurations, the
> assumptions involved, and the relevant property settings. The
> primary focus is on a few canonical deployments scenarios and
> surrounding issues. Relevant properties are described, but a
> complete description of all properties is not attempted here.
>
> The process startup sequence is also described in order to see
> differences between different deployments.
>
> Flexibility of assumptions is noted with MUST (rigid) or SHOULD
> (highly recommended, but could be different for the adventurous).
>
> === Configuration File Overview ===
>
> When building Nutch, the conf directory has 2 important property
> files that are put into the classpath for lookup at runtime:
>
> * ''nutch-default.xml'' the place for universal defaults as set by
> the Nutch developers.
> * ''nutch-site.xml'' the highest priority properties that override
> all other.
>
> The Java System Properties are ''not'' consulted for Nutch
> properties, so -D style commandline overriding is strongly
> discouraged. However, System Properties are used when standard
> properties are to be found.
>
> The bin/nutch sh script places $NUTCH_HOME/conf at the beginning of
> the classpath so that the xml property files can be found.
>
> === Nutch Shell Scripts ===
>
> A meta-assumption here is that the sh scripts in the nutch bin
> directory are used to start and control the ensemble of processes
> across many machines.
>
> The Nutch shell scripts are simple and elegant and they form a call
> hierarchy, starting at the top level:
> 1. start_all.sh or stop_all.sh - start and stop whole ensemble.
> 2. nutch_daemons.sh - run a Nutch command on all slave hosts.
> 3. slaves.sh - run a shell command on all slave hosts.
> 4. nutch_daemon.sh - run a Nutch command as a daemon with a start|
> stop argument like a regular Unix/Linux /etc/rc.local script; the
> process pid is stored during start and used during stop. Runs
> rsync at start.
> 5. nutch - run a Nutch command using the JVM.
>
> Depending upon the context of use, any level of these scripts can
> be handy on the command line.
>
> === Configuration Assumptions ===
>
> For simplicity of configuration, filenames you pass to commands
> SHOULD be pathnames that work on all hosts. When working with just
> a few hosts, this seems to be a limitation, but it obviously makes
> a lot of sense when hundreds or thousands of machines are involved.
>
> 1. property settings are meant to be the same across hosts; they
> are SHOULD not be customized per host (they are not even settable
> on the commandline, so per-process settings are discouraged).
> 2. filenames and paths are meant to be the same across hosts
> (SHOULD).
> 3. Some file paths are ambivalent about NDFS/local filesystem and
> are interpreted depending on which kind of filesystem is in use.
> 4. each machine SHOULD have (including the master) nutch installed
> in the same filesystem path.
> 5. The ndfs.data.dir and mapred.local.dir properties list comma
> separated directories. Only those that exist are used. So not all
> machines are required to have exactly the same devices.
>
> === System Assumptions ===
>
> 1. The env var NUTCH_MASTER is set to the hostname of the master
> machine.
> 2. The slave nodes are defined by putting list of hostnames, one
> per line, in ~/.slaves (alternatively, use NUTCH_SLAVES to refer
> to a different file).
> 3. a cluster of machines is managed from a master machine,
> without a firewall in bewteen any of the machines (MUST, for
> simplicity). Many tcp/ip ports are used.
> 4. the master machine MUST have a no-password login (ssh) to all
> the slave machines, using the same username.
> 5. set environment variables in ~/.ssh/environment, since ssh
> does not source your .bash_profile. These include JAVA_HOME,
> NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
> 6. make sure that your NUTCH_LOG_DIR and the directories named in
> ndfs.data.dir exist on all slaves. This can be done most easily
> with bin/slaves.sh.
>
> === Deployment Startup Sequences ===
>
> A. Cluster deployment with too many machines to customize
> (probably more than 4; 1000 machines should be possible):
>
> 6. bin/slaves.sh rsync-command is used as needed to update jars
> and conf files from master.
> 7. the ensemble starts by running bin/start-all.sh on the master.
> 8. start-all.sh uses bin/nutch-daemons.sh run one datanode
> process on each slave (in the background without waiting, one
> daemon thread is started per comma-separated storage device, non-
> existent storage devices in the list are ignored).
> 9. start-all.sh runs one namenode and one jobtracker on the master.
> 10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker
> process on each slave (in the background without waiting).
>
>
> B. Cluster of a few machines:
> 1. ''Add more details here''
>
> C. One developer debugging on one machine:
> 1. ''Add more details here''
>
Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace
Posted by Stefan Groschupf <sg...@media-style.com>.
ups, sorry...
Paul, you may should mentioned that this scripts require ssh in a
version higher than 3.8.
A great page!
Stefan
Am 11.11.2005 um 13:45 schrieb Stefan Groschupf:
>
> Am 11.11.2005 um 11:48 schrieb Apache Wiki:
>
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Nutch
>> Wiki" for change notification.
>>
>> The following page has been changed by PaulBaclace:
>> http://wiki.apache.org/nutch/OverviewDeploymentConfigs
>>
>> New page:
>> == Overview of Deployment Configurations in Nutch 0.8 ==
>> (11/2005 Paul Baclace)
>>
>> This page describes a range of deployment configurations, the
>> assumptions involved, and the relevant property settings. The
>> primary focus is on a few canonical deployments scenarios and
>> surrounding issues. Relevant properties are described, but a
>> complete description of all properties is not attempted here.
>>
>> The process startup sequence is also described in order to see
>> differences between different deployments.
>>
>> Flexibility of assumptions is noted with MUST (rigid) or SHOULD
>> (highly recommended, but could be different for the adventurous).
>>
>> === Configuration File Overview ===
>>
>> When building Nutch, the conf directory has 2 important property
>> files that are put into the classpath for lookup at runtime:
>>
>> * ''nutch-default.xml'' the place for universal defaults as set
>> by the Nutch developers.
>> * ''nutch-site.xml'' the highest priority properties that
>> override all other.
>>
>> The Java System Properties are ''not'' consulted for Nutch
>> properties, so -D style commandline overriding is strongly
>> discouraged. However, System Properties are used when standard
>> properties are to be found.
>>
>> The bin/nutch sh script places $NUTCH_HOME/conf at the beginning
>> of the classpath so that the xml property files can be found.
>>
>> === Nutch Shell Scripts ===
>>
>> A meta-assumption here is that the sh scripts in the nutch bin
>> directory are used to start and control the ensemble of processes
>> across many machines.
>>
>> The Nutch shell scripts are simple and elegant and they form a
>> call hierarchy, starting at the top level:
>> 1. start_all.sh or stop_all.sh - start and stop whole ensemble.
>> 2. nutch_daemons.sh - run a Nutch command on all slave hosts.
>> 3. slaves.sh - run a shell command on all slave hosts.
>> 4. nutch_daemon.sh - run a Nutch command as a daemon with a start|
>> stop argument like a regular Unix/Linux /etc/rc.local script; the
>> process pid is stored during start and used during stop. Runs
>> rsync at start.
>> 5. nutch - run a Nutch command using the JVM.
>>
>> Depending upon the context of use, any level of these scripts can
>> be handy on the command line.
>>
>> === Configuration Assumptions ===
>>
>> For simplicity of configuration, filenames you pass to commands
>> SHOULD be pathnames that work on all hosts. When working with just
>> a few hosts, this seems to be a limitation, but it obviously makes
>> a lot of sense when hundreds or thousands of machines are involved.
>>
>> 1. property settings are meant to be the same across hosts; they
>> are SHOULD not be customized per host (they are not even settable
>> on the commandline, so per-process settings are discouraged).
>> 2. filenames and paths are meant to be the same across hosts
>> (SHOULD).
>> 3. Some file paths are ambivalent about NDFS/local filesystem and
>> are interpreted depending on which kind of filesystem is in use.
>> 4. each machine SHOULD have (including the master) nutch
>> installed in the same filesystem path.
>> 5. The ndfs.data.dir and mapred.local.dir properties list comma
>> separated directories. Only those that exist are used. So not
>> all machines are required to have exactly the same devices.
>>
>> === System Assumptions ===
>>
>> 1. The env var NUTCH_MASTER is set to the hostname of the master
>> machine.
>> 2. The slave nodes are defined by putting list of hostnames, one
>> per line, in ~/.slaves (alternatively, use NUTCH_SLAVES to refer
>> to a different file).
>> 3. a cluster of machines is managed from a master machine,
>> without a firewall in bewteen any of the machines (MUST, for
>> simplicity). Many tcp/ip ports are used.
>> 4. the master machine MUST have a no-password login (ssh) to all
>> the slave machines, using the same username.
>> 5. set environment variables in ~/.ssh/environment, since ssh
>> does not source your .bash_profile. These include JAVA_HOME,
>> NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
>> 6. make sure that your NUTCH_LOG_DIR and the directories named
>> in ndfs.data.dir exist on all slaves. This can be done most
>> easily with bin/slaves.sh.
>>
>> === Deployment Startup Sequences ===
>>
>> A. Cluster deployment with too many machines to customize
>> (probably more than 4; 1000 machines should be possible):
>>
>> 6. bin/slaves.sh rsync-command is used as needed to update jars
>> and conf files from master.
>> 7. the ensemble starts by running bin/start-all.sh on the master.
>> 8. start-all.sh uses bin/nutch-daemons.sh run one datanode
>> process on each slave (in the background without waiting, one
>> daemon thread is started per comma-separated storage device, non-
>> existent storage devices in the list are ignored).
>> 9. start-all.sh runs one namenode and one jobtracker on the master.
>> 10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker
>> process on each slave (in the background without waiting).
>>
>>
>> B. Cluster of a few machines:
>> 1. ''Add more details here''
>>
>> C. One developer debugging on one machine:
>> 1. ''Add more details here''
>>
>
>