You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/07/07 17:38:49 UTC

[jira] Created: (NUTCH-843) Separate the build and runtime environments

Separate the build and runtime environments
-------------------------------------------

Key: NUTCH-843
URL: https://issues.apache.org/jira/browse/NUTCH-843
Project: Nutch
Issue Type: Improvement
Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki

Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.

Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.

It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.

This issue proposes then to separate these environments into the following areas:

* source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.

* build area - contains build artifacts, among them the nutch.job jar.

* runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.

For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886015#action_12886015 ] 

Andrzej Bialecki  commented on NUTCH-843:
-----------------------------------------

We need to create the job file anyway. Actually, the patch I attached does something like this for the local setup (lib/ is flattened), but still I would argue for setting up two areas, /runtime/deploy and /runtime/local - it's painfully obvious then what parts you need to deploy to a Hadoop cluster.

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886315#action_12886315 ] 

Julien Nioche commented on NUTCH-843:
-------------------------------------

I really like this. 

What shall we do with the hadoop scripts in /bin and the native libs in /lib? Should they go to runtime/local as well?


> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-843.
-------------------------------------

    Fix Version/s: 2.0
       Resolution: Fixed

 Committed in r961498. Thanks Chris for the review!

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885978#action_12885978 ] 

Chris A. Mattmann commented on NUTCH-843:
-----------------------------------------

OK, so I read this more. I think it would be great if we didn't have to maintain 2 diff deployment structures based on local/remote. Some comments on your local proposal:

{code}
bin/nutch           - the main nutch script
conf/                 - all relevant Nutch conf files
lib/hadoop-libs  - static Hadoop lib files - are these jar files?
lib/nutch-libs     - what are these? jar files?
plugins/             - are these the plugin directories, or plugin jar files? 
nutch.jar           -  why wouldn't this go into the lib directory?
{code}

I could envision having one simple deployment structure that looked like this:

./bin/          - nutch script goes into here
./etc/          - all Nutch configuration property files, like nutch-default.xml, nutch-site.xml
./lib/           - all shared Nutch jar files (including the nutch.jar and hadoop.jar, as well as deps). Also it would be great to be able to generate a per-plugin jars that we could include in this lib directory as well. 
./logs/        - where all log files are written to
./run/         - where PID files (if generated) are written to

Thoughts?
 

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Pham Tuan Minh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887333#action_12887333 ] 

Pham Tuan Minh commented on NUTCH-843:
--------------------------------------

Thanks Julien, just call me Minh! 

revision 963217 resolved my comment on issue NUTCH-846

https://issues.apache.org/jira/browse/NUTCH-846

Thanks,



> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886323#action_12886323 ] 

Julien Nioche commented on NUTCH-843:
-------------------------------------

OK - for some reason I thought we could use runtime/local in pseudo distributed mode as well. Probably need another coffee :-)

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-843:
------------------------------------

    Attachment: NUTCH-843.patch

Updated patch that moves nutch.jar to lib/ for the local runtime.

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Pham Tuan Minh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887000#action_12887000 ] 

Pham Tuan Minh commented on NUTCH-843:
--------------------------------------

Hi,

I found that after building runtime,

In nutch-2.0-dev.job and local\lib directory contains different versions of the same library 

ant-1.7.1.jar
ant-1.6.5.jar

servlet-api-2.5-20081211.jar
servlet-api-2.5-6.1.14.jar

Thanks,

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-843) Separate the build and runtime environments

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885978#action_12885978 ] 

Chris A. Mattmann edited comment on NUTCH-843 at 7/7/10 12:31 PM:
------------------------------------------------------------------

OK, so I read this more. I think it would be great if we didn't have to maintain 2 diff deployment structures based on local/remote. Some comments on your local proposal:

{code}
bin/nutch           - the main nutch script
conf/                 - all relevant Nutch conf files
lib/hadoop-libs  - static Hadoop lib files - are these jar files?
lib/nutch-libs     - what are these? jar files?
plugins/             - are these the plugin directories, or plugin jar files? 
nutch.jar           -  why wouldn't this go into the lib directory?
{code}

I could envision having one simple deployment structure that looked like this:

{code}
./bin/          - nutch script goes into here
./etc/          - all Nutch configuration property files, like nutch-default.xml, nutch-site.xml
./lib/           - all shared Nutch jar files (including the nutch.jar and hadoop.jar, as well as deps). Also it would be great to be able to generate a per-plugin jars that we could include in this lib directory as well. 
./logs/        - where all log files are written to
./run/         - where PID files (if generated) are written to
{code}

Thoughts?
 

      was (Author: chrismattmann):
    OK, so I read this more. I think it would be great if we didn't have to maintain 2 diff deployment structures based on local/remote. Some comments on your local proposal:

{code}
bin/nutch           - the main nutch script
conf/                 - all relevant Nutch conf files
lib/hadoop-libs  - static Hadoop lib files - are these jar files?
lib/nutch-libs     - what are these? jar files?
plugins/             - are these the plugin directories, or plugin jar files? 
nutch.jar           -  why wouldn't this go into the lib directory?
{code}

I could envision having one simple deployment structure that looked like this:

./bin/          - nutch script goes into here
./etc/          - all Nutch configuration property files, like nutch-default.xml, nutch-site.xml
./lib/           - all shared Nutch jar files (including the nutch.jar and hadoop.jar, as well as deps). Also it would be great to be able to generate a per-plugin jars that we could include in this lib directory as well. 
./logs/        - where all log files are written to
./run/         - where PID files (if generated) are written to

Thoughts?
 
  
> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887317#action_12887317 ] 

Julien Nioche commented on NUTCH-843:
-------------------------------------

 revision 963217 : removed task extract-hadoop from Ant build to avoid creation of hadoop scripts in bin dir

@pham : your comment is not relevant to this issue. please create a separate issue, thanks 

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886318#action_12886318 ] 

Andrzej Bialecki  commented on NUTCH-843:
-----------------------------------------

runtime/local doesn't need Hadoop scripts, by definition it uses local FS and local job tracker, so Hadoop scripts are of no use. Native libs .. see NUTCH-845.

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886012#action_12886012 ] 

Chris A. Mattmann commented on NUTCH-843:
-----------------------------------------

Hey Andrzej:

Wouldn't my proposed deployment structure in theory be equivalent to say creating a .job file as you proposed above? You can think of the proposed dir structure as an exploded version of the unpacked .job?

Cheers,
Chris


> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885967#action_12885967 ] 

Chris A. Mattmann commented on NUTCH-843:
-----------------------------------------

Super +1 

I've wanted to do something like this for a looong time http://markmail.org/thread/osmfz6pknr4n4unf

;)

Let me think about the deployment structure a little bit and comment back on this issue...

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886055#action_12886055 ] 

Chris A. Mattmann commented on NUTCH-843:
-----------------------------------------

+1 I think this patch makes great progress! I think it would be good to tease out a single deployment structure in the future, but this works perfect for now...

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885979#action_12885979 ] 

Andrzej Bialecki  edited comment on NUTCH-843 at 7/7/10 12:37 PM:
------------------------------------------------------------------

This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and /runtime/local areas, populated with the right pieces. bin/nutch has been modified to work correctly in both cases.

(Edit) Sorry, I just read your comment - I'm afraid of having a single area, because then again it's not clear what bits and pieces need to be deployed to Hadoop master.

      was (Author: ab):
    This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and /runtime/local areas, populated with the right pieces. bin/nutch has been modified to work correctly in both cases.
  
> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886330#action_12886330 ] 

Andrzej Bialecki  commented on NUTCH-843:
-----------------------------------------

Pseudo-distributed (i.e. a real JobTracker with a single TaskTracker) suffers from the same classpath issues that I described above, so even in such case it's best to run jobs in a separate environment, using /runtime/deploy artifacts.

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-843.patch, NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-843:
------------------------------------

    Attachment: NUTCH-843.patch

This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and /runtime/local areas, populated with the right pieces. bin/nutch has been modified to work correctly in both cases.

> Separate the build and runtime environments
> -------------------------------------------
>
>                 Key: NUTCH-843
>                 URL: https://issues.apache.org/jira/browse/NUTCH-843
>             Project: Nutch
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: NUTCH-843.patch
>
>
> Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence.
> Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored.
> It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues.
> This issue proposes then to separate these environments into the following areas:
> * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar.
> * build area - contains build artifacts, among them the nutch.job jar.
> * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following:
> {code}
> bin/nutch
> nutch.job
> {code}
> That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node.
> For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following:
> {code}
> bin/nutch
> lib/hadoop-libs
> plugins/
> nutch.job
> {code}
> Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this:
> {code}
> bin/nutch
> conf/
> lib/hadoop-libs
> lib/nutch-libs
> plugins/
> nutch.jar
> {code}
> so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.