You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zoltán Zvara <zo...@gmail.com> on 2017/07/18 18:50:24 UTC

Configuration is not found by Nutch when running Inject remotely

Dear Community,

I'm running Inject job programatically, from within IntelliJ, where the target cluster's (YARN) configuration and Nutch configuration is in the classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories are set - to distributions that I have on my local machine.

Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject job starts correctly. However, during the initialization (setup) phase of the mapper (InjectMapper), an exception is thrown:

Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)

On the YARN NodeManagers, a Nutch distribution is sitting with a configuration (nutch-site.xml) that has a key "plugin.folders" that points to the plugin folders by an absolute path. As for YARN, I've set up additional environment variables for NMs, as follows:

<property>
<name>yarn.nodemanager.admin-env</name>
<value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/</value>
</property>

In addition to this, I have set MR environment variables as well:

<property>
<name>mapred.child.env</name>
<value>NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf</value>
</property>

I've tried to run the program with JVM parameters, supplied with -D to define "plugin.folders".

Probably I'm missing something. How should I define "plugin.folders", when the inject job is submitted and run remotely.

Thanks for helping me out.

Zoltán

Re: Configuration is not found by Nutch when running Inject remotely

Posted by Sebastian Nagel <wa...@googlemail.com>.
> is to pass configuration parameters programatically

That shouldn't be difficult as all Nutch tools get the configuration from
the class NutchConfiguration.

Thanks,
Sebastian

On 07/19/2017 06:40 PM, Zoltán Zvara wrote:
> Hi Sebastian,
> 
> Thanks for your tips. I have switched on debugging for YARN, and kept "launch_container.sh" for a few minutes to be able to examine. HADOOP AND NUTCH CONF + HOME directories were correctly set for AM as well as MR.YarnChild. CLASSPATH has been set correctly to Nutch configuration, therefore nutch-site.xml should be picked up. As I've realized, some "job.xml" is attached to the submission from my remote computer, which includes any parameter set by the remote JVM by a HadoopConfiguration. This means the only way to configure such a remote launch is to pass configuration parameters programatically.
> 
> For example:
> val hConf = new HadoopConfiguration()
> hConf.set(..., ...)
> hConf.set(..., ...)
> 
> val injection = new Injection(hConf)
> injection.inject(...)
> 
> The above is just a pseudo code. Sorry if there are any mistakes.
> 
> Cheers,
> Zoltán
> On 2017-07-19 17:43:13, Sebastian Nagel <wa...@googlemail.com> wrote:
> Hi Zoltán,
> 
> a warning ahead: personally, I've never tried to control Nutch launch remotely,
> so I know no solution.
> 
> If the property "plugin.folders" is not known this means Nutch
> also didn't read nutch-default.xml where it is defined. I would start
> to look at the classpath whether it contains the configuration
> folder (local mode) or the apache-nutch-*.job file (distributed mode).
> 
> Note that the environment variable NUTCH_CONF_DIR is used only by
> bin/nutch - the path is added to the classpath. Loading of configuration
> files (nutch-site.xml and nutch-default.xml) is delegated to Hadoop.
> Similarly, NUTCH_HOME is only used to find the Nutch installation or
> the job file.
> 
> To analyze the problem, try to set
> log4j.logger.org.apache.hadoop=WARN
> to INFO or DEBUG.
> 
> Best,
> Sebastian
> 
> On 07/18/2017 08:50 PM, Zoltán Zvara wrote:
>> Dear Community,
>>
>> I'm running Inject job programatically, from within IntelliJ, where the target cluster's (YARN) configuration and Nutch configuration is in the classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories are set - to distributions that I have on my local machine.
>>
>> Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject job starts correctly. However, during the initialization (setup) phase of the mapper (InjectMapper), an exception is thrown:
>>
>> Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
>> at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
>> at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
>> at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
>> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
>> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
>>
>> On the YARN NodeManagers, a Nutch distribution is sitting with a configuration (nutch-site.xml) that has a key "plugin.folders" that points to the plugin folders by an absolute path. As for YARN, I've set up additional environment variables for NMs, as follows:
>>
>>
>> yarn.nodemanager.admin-env
>> MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/
>>
>>
>> In addition to this, I have set MR environment variables as well:
>>
>>
>> mapred.child.env
>> NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf
>>
>>
>> I've tried to run the program with JVM parameters, supplied with -D to define "plugin.folders".
>>
>> Probably I'm missing something. How should I define "plugin.folders", when the inject job is submitted and run remotely.
>>
>> Thanks for helping me out.
>>
>> Zoltán
>>
> 
> 


Re: Configuration is not found by Nutch when running Inject remotely

Posted by Zoltán Zvara <zo...@gmail.com>.
Hi Sebastian,

Thanks for your tips. I have switched on debugging for YARN, and kept "launch_container.sh" for a few minutes to be able to examine. HADOOP AND NUTCH CONF + HOME directories were correctly set for AM as well as MR.YarnChild. CLASSPATH has been set correctly to Nutch configuration, therefore nutch-site.xml should be picked up. As I've realized, some "job.xml" is attached to the submission from my remote computer, which includes any parameter set by the remote JVM by a HadoopConfiguration. This means the only way to configure such a remote launch is to pass configuration parameters programatically.

For example:
val hConf = new HadoopConfiguration()
hConf.set(..., ...)
hConf.set(..., ...)

val injection = new Injection(hConf)
injection.inject(...)

The above is just a pseudo code. Sorry if there are any mistakes.

Cheers,
Zoltán
On 2017-07-19 17:43:13, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi Zoltán,

a warning ahead: personally, I've never tried to control Nutch launch remotely,
so I know no solution.

If the property "plugin.folders" is not known this means Nutch
also didn't read nutch-default.xml where it is defined. I would start
to look at the classpath whether it contains the configuration
folder (local mode) or the apache-nutch-*.job file (distributed mode).

Note that the environment variable NUTCH_CONF_DIR is used only by
bin/nutch - the path is added to the classpath. Loading of configuration
files (nutch-site.xml and nutch-default.xml) is delegated to Hadoop.
Similarly, NUTCH_HOME is only used to find the Nutch installation or
the job file.

To analyze the problem, try to set
log4j.logger.org.apache.hadoop=WARN
to INFO or DEBUG.

Best,
Sebastian

On 07/18/2017 08:50 PM, Zoltán Zvara wrote:
> Dear Community,
>
> I'm running Inject job programatically, from within IntelliJ, where the target cluster's (YARN) configuration and Nutch configuration is in the classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories are set - to distributions that I have on my local machine.
>
> Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject job starts correctly. However, during the initialization (setup) phase of the mapper (InjectMapper), an exception is thrown:
>
> Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
> at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
> at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
> at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
>
> On the YARN NodeManagers, a Nutch distribution is sitting with a configuration (nutch-site.xml) that has a key "plugin.folders" that points to the plugin folders by an absolute path. As for YARN, I've set up additional environment variables for NMs, as follows:
>
>
> yarn.nodemanager.admin-env
> MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/
>
>
> In addition to this, I have set MR environment variables as well:
>
>
> mapred.child.env
> NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf
>
>
> I've tried to run the program with JVM parameters, supplied with -D to define "plugin.folders".
>
> Probably I'm missing something. How should I define "plugin.folders", when the inject job is submitted and run remotely.
>
> Thanks for helping me out.
>
> Zoltán
>


Re: Configuration is not found by Nutch when running Inject remotely

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Zoltán,

a warning ahead: personally, I've never tried to control Nutch launch remotely,
so I know no solution.

If the property "plugin.folders" is not known this means Nutch
also didn't read nutch-default.xml where it is defined.  I would start
to look at the classpath whether it contains the configuration
folder (local mode) or the apache-nutch-*.job file (distributed mode).

Note that the environment variable NUTCH_CONF_DIR is used only by
bin/nutch - the path is added to the classpath.  Loading of configuration
files (nutch-site.xml and nutch-default.xml) is delegated to Hadoop.
Similarly, NUTCH_HOME is only used to find the Nutch installation or
the job file.

To analyze the problem, try to set
 log4j.logger.org.apache.hadoop=WARN
to INFO or DEBUG.

Best,
Sebastian

On 07/18/2017 08:50 PM, Zoltán Zvara wrote:
> Dear Community,
> 
> I'm running Inject job programatically, from within IntelliJ, where the target cluster's (YARN) configuration and Nutch configuration is in the classpath. In addition to this, HADOOP and NUTCH CONF and HOME directories are set - to distributions that I have on my local machine.
> 
> Starting the program, the Nutch Inject connects to YARN 2.8.0 and the inject job starts correctly. However, during the initialization (setup) phase of the mapper (InjectMapper), an exception is thrown:
> 
> Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
> at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
> at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:71)
> at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
> 
> On the YARN NodeManagers, a Nutch distribution is sitting with a configuration (nutch-site.xml) that has a key "plugin.folders" that points to the plugin folders by an absolute path. As for YARN, I've set up additional environment variables for NMs, as follows:
> 
> <property>
> <name>yarn.nodemanager.admin-env</name>
> <value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf/,NUTCH_HOME=/opt/apache-nutch-1.13/</value>
> </property>
> 
> In addition to this, I have set MR environment variables as well:
> 
> <property>
> <name>mapred.child.env</name>
> <value>NUTCH_HOME=/opt/apache-nutch-1.13,NUTCH_CONF_DIR=/opt/apache-nutch-1.13/conf</value>
> </property>
> 
> I've tried to run the program with JVM parameters, supplied with -D to define "plugin.folders".
> 
> Probably I'm missing something. How should I define "plugin.folders", when the inject job is submitted and run remotely.
> 
> Thanks for helping me out.
> 
> Zoltán
>