You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/07/07 22:32:49 UTC
[jira] Resolved: (NUTCH-843) Separate the build and runtime environments

     [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-843.
-------------------------------------

    Fix Version/s: 2.0
       Resolution: Fixed

 Committed in r961498. Thanks Chris for the review!

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.