You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by fa...@butterflycluster.net on 2009/08/18 07:04:20 UTC
scheduling
hi,
I have a requirement to build a simple UI for starting stopping the
crawler, and also a scheduling mechanism (Quartz).
Has anyone attempted this before? Any lessons learned or any
suggestions how to best go about this?
i am just worried about issues of running the crawler inside a web
container etc..
there is the option of unix cron jobs but thats the last option unfortunately.
thanks.
Re: scheduling
Posted by fa...@butterflycluster.net.
uh wow!~
never heard of this one before. thanks for this information!
Quoting rzo <rz...@gmx.de>:
> hi,
>
> you may take a look at
> http://yajsw.sourceforge.net/
>
> this is a framework for wrapping applications and for running them as
> services or daemons.
> it includes a sample script for running/scheduling nutch with tomcat
> and with solr.
>
> it comes with a system tray icon which can be used to start/stop the
> application.
>
> - ron
>
> fadzi@butterflycluster.net wrote:
>> hi,
>>
>> I have a requirement to build a simple UI for starting stopping the
>> crawler, and also a scheduling mechanism (Quartz).
>>
>> Has anyone attempted this before? Any lessons learned or any
>> suggestions how to best go about this?
>>
>> i am just worried about issues of running the crawler inside a web
>> container etc..
>>
>> there is the option of unix cron jobs but thats the last option
>> unfortunately.
>>
>> thanks.
>>
>>
Re: scheduling
Posted by rzo <rz...@gmx.de>.
hi,
you may take a look at
http://yajsw.sourceforge.net/
this is a framework for wrapping applications and for running them as
services or daemons.
it includes a sample script for running/scheduling nutch with tomcat and
with solr.
it comes with a system tray icon which can be used to start/stop the
application.
- ron
fadzi@butterflycluster.net wrote:
> hi,
>
> I have a requirement to build a simple UI for starting stopping the
> crawler, and also a scheduling mechanism (Quartz).
>
> Has anyone attempted this before? Any lessons learned or any
> suggestions how to best go about this?
>
> i am just worried about issues of running the crawler inside a web
> container etc..
>
> there is the option of unix cron jobs but thats the last option
> unfortunately.
>
> thanks.
>
>
Re: scheduling
Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 18, 2009, at 10:04 AM, fadzi@butterflycluster.net wrote:
>>>
>>> tried that; no joy still.
>>>
>>>
>>> are there any specifics i need to put in nutch-site.xml?
In the conf folder you have a nutch-site.xml.tempate.
>>>
>>>
>>> because mine is blank at the moment.
Ok. create a nutch-site.xml file in your conf folder with the
following content
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.folders</name>
<value>src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</
description>
</property>
<property>
<name>http.agent.name</name>
<value>Nutch-Gui-1.0</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your
organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
</configuration>
after that start the class AdministartionApplication with parameter "/
tmp/nutchGui" and "50060"
please delete the folder "/tmp/nutchGui" if the folder already exists.
after that the gui should starting and you can open the browser on
"127.0.0.1:50060/general".
but the gui is currently only in german language :(.
in the next days we translate it via i18n.
marko
>>>
>>>
>>>
>>>
>>> Quoting Marko Bauhardt <mb...@101tec.com>:
>>>
>>>>
>>>> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>>>>
>>>>> Hi Marko,
>>>>
>>>> Hi
>>>>
>>>>>
>>>>>
>>>>> I am trying to run the AdminApp; whats value for working
>>>>> directory and port number?
>>>>
>>>> for the testing i use as working directory the "/tmp" directory.
>>>> for the http-port i use the 50060 or 8080. start the app on the
>>>> port
>>>> you want to start.
>>>>
>>>>>
>>>>>
>>>>> i used working directory as my nutch root directory.
>>>>>
>>>>> i am getting:
>>>>>
>>>>> java.lang.RuntimeException: x-point
>>>>> org.apache.nutch.admin.IGuiComponent not found, check your
>>>>> plugin folder
>>>>> at
>>>>> org
>>>>> .apache
>>>>> .nutch
>>>>> .admin
>>>>> .GuiComponentExtensionContainer
>>>>> .getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>>>>> at
>>>>> org
>>>>> .apache
>>>>> .nutch
>>>>> .admin
>>>>> .GuiComponentExtensionContainer
>>>>> .getGeneralGuiComponentExtensions
>>>>> (GuiComponentExtensionContainer.java:36)
>>>>> at
>>>>> org
>>>>> .apache
>>>>> .nutch
>>>>> .admin
>>>>> .GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>>>>> at
>>>>> org
>>>>> .apache
>>>>> .nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:
>>>>> 102)
>>>>
>>>>
>>>> i think your plugin folder is not correct.
>>>> if you start the ui from eclipse then create a nutch-site.xml in
>>>> your
>>>> conf folder.
>>>> for example.
>>>>
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>>
>>>> <configuration>
>>>>
>>>> <property>
>>>> <name>plugin.folders</name>
>>>> <value>src/plugin</value>
>>>> <description>Directories where nutch plugins are located. Each
>>>> element may be a relative or absolute path. If absolute, it is
>>>> used
>>>> as is. If relative, it is searched for on the classpath.</
>>>> description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.agent.name</name>
>>>> <value>Nutch-Gui-0.1</value>
>>>> <description>HTTP 'User-Agent' request header. MUST NOT be
>>>> empty -
>>>> please set this to a single word uniquely related to your
>>>> organization.
>>>>
>>>> NOTE: You should also check other related properties:
>>>>
>>>> http.robots.agents
>>>> http.agent.description
>>>> http.agent.url
>>>> http.agent.email
>>>> http.agent.version
>>>>
>>>> and set their values appropriately.
>>>>
>>>> </description>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>> you should also create a conf/regex-urlfilter.txt file.
>>>>
>>>>
>>>> hth
>>>> marko
>>>
>>>
>>>
>>>
>
>
>
>
Re: scheduling
Posted by fa...@butterflycluster.net.
from eclipse.
Quoting Marko Bauhardt <mb...@101tec.com>:
> Do you start the gui from eclipse or from binary package?
>
> marko
>
>
>
> On Aug 18, 2009, at 9:57 AM, fadzi@butterflycluster.net wrote:
>
>>
>> tried that; no joy still.
>>
>> are there any specifics i need to put in nutch-site.xml?
>>
>> because mine is blank at the moment.
>>
>>
>>
>> Quoting Marko Bauhardt <mb...@101tec.com>:
>>
>>>
>>> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>>>
>>>> Hi Marko,
>>>
>>> Hi
>>>
>>>>
>>>>
>>>> I am trying to run the AdminApp; whats value for working
>>>> directory and port number?
>>>
>>> for the testing i use as working directory the "/tmp" directory.
>>> for the http-port i use the 50060 or 8080. start the app on the port
>>> you want to start.
>>>
>>>>
>>>>
>>>> i used working directory as my nutch root directory.
>>>>
>>>> i am getting:
>>>>
>>>> java.lang.RuntimeException: x-point
>>>> org.apache.nutch.admin.IGuiComponent not found, check your plugin
>>>> folder
>>>> at
>>>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>>>> at
>>>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGeneralGuiComponentExtensions(GuiComponentExtensionContainer.java:36)
>>>> at
>>>> org.apache.nutch.admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>>>> at
>>>> org.apache.nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
>>>
>>>
>>> i think your plugin folder is not correct.
>>> if you start the ui from eclipse then create a nutch-site.xml in your
>>> conf folder.
>>> for example.
>>>
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>
>>> <!-- Put site-specific property overrides in this file. -->
>>>
>>> <configuration>
>>>
>>> <property>
>>> <name>plugin.folders</name>
>>> <value>src/plugin</value>
>>> <description>Directories where nutch plugins are located. Each
>>> element may be a relative or absolute path. If absolute, it is used
>>> as is. If relative, it is searched for on the classpath.</description>
>>> </property>
>>>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>Nutch-Gui-0.1</value>
>>> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>>> please set this to a single word uniquely related to your
>>> organization.
>>>
>>> NOTE: You should also check other related properties:
>>>
>>> http.robots.agents
>>> http.agent.description
>>> http.agent.url
>>> http.agent.email
>>> http.agent.version
>>>
>>> and set their values appropriately.
>>>
>>> </description>
>>> </property>
>>>
>>> </configuration>
>>>
>>>
>>> you should also create a conf/regex-urlfilter.txt file.
>>>
>>>
>>> hth
>>> marko
>>
>>
>>
>>
Re: scheduling
Posted by Marko Bauhardt <mb...@101tec.com>.
Do you start the gui from eclipse or from binary package?
marko
On Aug 18, 2009, at 9:57 AM, fadzi@butterflycluster.net wrote:
>
> tried that; no joy still.
>
> are there any specifics i need to put in nutch-site.xml?
>
> because mine is blank at the moment.
>
>
>
> Quoting Marko Bauhardt <mb...@101tec.com>:
>
>>
>> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>>
>>> Hi Marko,
>>
>> Hi
>>
>>>
>>>
>>> I am trying to run the AdminApp; whats value for working
>>> directory and port number?
>>
>> for the testing i use as working directory the "/tmp" directory.
>> for the http-port i use the 50060 or 8080. start the app on the port
>> you want to start.
>>
>>>
>>>
>>> i used working directory as my nutch root directory.
>>>
>>> i am getting:
>>>
>>> java.lang.RuntimeException: x-point
>>> org.apache.nutch.admin.IGuiComponent not found, check your plugin
>>> folder
>>> at
>>> org
>>> .apache
>>> .nutch
>>> .admin
>>> .GuiComponentExtensionContainer
>>> .getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>>> at
>>> org
>>> .apache
>>> .nutch
>>> .admin
>>> .GuiComponentExtensionContainer
>>> .getGeneralGuiComponentExtensions
>>> (GuiComponentExtensionContainer.java:36)
>>> at
>>> org
>>> .apache
>>> .nutch
>>> .admin
>>> .GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>>> at
>>> org
>>> .apache
>>> .nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
>>
>>
>> i think your plugin folder is not correct.
>> if you start the ui from eclipse then create a nutch-site.xml in your
>> conf folder.
>> for example.
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>>
>> <property>
>> <name>plugin.folders</name>
>> <value>src/plugin</value>
>> <description>Directories where nutch plugins are located. Each
>> element may be a relative or absolute path. If absolute, it is
>> used
>> as is. If relative, it is searched for on the classpath.</
>> description>
>> </property>
>>
>> <property>
>> <name>http.agent.name</name>
>> <value>Nutch-Gui-0.1</value>
>> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>> please set this to a single word uniquely related to your
>> organization.
>>
>> NOTE: You should also check other related properties:
>>
>> http.robots.agents
>> http.agent.description
>> http.agent.url
>> http.agent.email
>> http.agent.version
>>
>> and set their values appropriately.
>>
>> </description>
>> </property>
>>
>> </configuration>
>>
>>
>> you should also create a conf/regex-urlfilter.txt file.
>>
>>
>> hth
>> marko
>
>
>
>
Re: scheduling
Posted by fa...@butterflycluster.net.
tried that; no joy still.
are there any specifics i need to put in nutch-site.xml?
because mine is blank at the moment.
Quoting Marko Bauhardt <mb...@101tec.com>:
>
> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>
>> Hi Marko,
>
> Hi
>
>>
>>
>> I am trying to run the AdminApp; whats value for working directory
>> and port number?
>
> for the testing i use as working directory the "/tmp" directory.
> for the http-port i use the 50060 or 8080. start the app on the port
> you want to start.
>
>>
>>
>> i used working directory as my nutch root directory.
>>
>> i am getting:
>>
>> java.lang.RuntimeException: x-point
>> org.apache.nutch.admin.IGuiComponent not found, check your plugin
>> folder
>> at
>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>> at
>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGeneralGuiComponentExtensions(GuiComponentExtensionContainer.java:36)
>> at
>> org.apache.nutch.admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>> at
>> org.apache.nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
>
>
> i think your plugin folder is not correct.
> if you start the ui from eclipse then create a nutch-site.xml in your
> conf folder.
> for example.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
> <name>plugin.folders</name>
> <value>src/plugin</value>
> <description>Directories where nutch plugins are located. Each
> element may be a relative or absolute path. If absolute, it is used
> as is. If relative, it is searched for on the classpath.</description>
> </property>
>
> <property>
> <name>http.agent.name</name>
> <value>Nutch-Gui-0.1</value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> please set this to a single word uniquely related to your
> organization.
>
> NOTE: You should also check other related properties:
>
> http.robots.agents
> http.agent.description
> http.agent.url
> http.agent.email
> http.agent.version
>
> and set their values appropriately.
>
> </description>
> </property>
>
> </configuration>
>
>
> you should also create a conf/regex-urlfilter.txt file.
>
>
> hth
> marko
Re: scheduling
Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
> Hi Marko,
Hi
>
>
> I am trying to run the AdminApp; whats value for working directory
> and port number?
for the testing i use as working directory the "/tmp" directory.
for the http-port i use the 50060 or 8080. start the app on the port
you want to start.
>
>
> i used working directory as my nutch root directory.
>
> i am getting:
>
> java.lang.RuntimeException: x-point
> org.apache.nutch.admin.IGuiComponent not found, check your plugin
> folder
> at
> org
> .apache
> .nutch
> .admin
> .GuiComponentExtensionContainer
> .getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
> at
> org
> .apache
> .nutch
> .admin
> .GuiComponentExtensionContainer
> .getGeneralGuiComponentExtensions
> (GuiComponentExtensionContainer.java:36)
> at
> org
> .apache
> .nutch
> .admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:
> 126)
> at
> org
> .apache
> .nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
i think your plugin folder is not correct.
if you start the ui from eclipse then create a nutch-site.xml in your
conf folder.
for example.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.folders</name>
<value>src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</
description>
</property>
<property>
<name>http.agent.name</name>
<value>Nutch-Gui-0.1</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your
organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
</configuration>
you should also create a conf/regex-urlfilter.txt file.
hth
marko
Re: scheduling
Posted by fa...@butterflycluster.net.
Hi Marko,
I am trying to run the AdminApp; whats value for working directory and
port number?
i used working directory as my nutch root directory.
i am getting:
java.lang.RuntimeException: x-point
org.apache.nutch.admin.IGuiComponent not found, check your plugin folder
at
org.apache.nutch.admin.GuiComponentExtensionContainer.getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
at
org.apache.nutch.admin.GuiComponentExtensionContainer.getGeneralGuiComponentExtensions(GuiComponentExtensionContainer.java:36)
at
org.apache.nutch.admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
at
org.apache.nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
Quoting Marko Bauhardt <mb...@101tec.com>:
>
> On Aug 18, 2009, at 7:04 AM, fadzi@butterflycluster.net wrote:
>
>> hi,
>
> Hi Fadzi,
>
>>
>>
>> I have a requirement to build a simple UI for starting stopping the
>> crawler, and also a scheduling mechanism (Quartz).
>>
>> Has anyone attempted this before?
>
> We have started to implement the upgrade of the nutch gui as well.
> the gui supports
>
> + creation of separate nutch instances
> + configuration of these instances
> + manual crawl's
> + scheduled crawl's
> + small statistics
> + system overview
> + url uploading
> + small simple search ui
> ...
>
> you can find the project on github.com. it is a fork of the apache
> nutch project from branch "tags/release-1.0"
> you can checkout the sources here
> http://github.com/101tec/nutch/tree/nutch-gui
> we plan to push a release at the end of this months.
>
> marko
>
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec GmbH
>
> Halle (Saale), Saxony-Anhalt, Germany
> http://www.101tec.com
Re: scheduling
Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 18, 2009, at 7:04 AM, fadzi@butterflycluster.net wrote:
> hi,
Hi Fadzi,
>
>
> I have a requirement to build a simple UI for starting stopping the
> crawler, and also a scheduling mechanism (Quartz).
>
> Has anyone attempted this before?
We have started to implement the upgrade of the nutch gui as well.
the gui supports
+ creation of separate nutch instances
+ configuration of these instances
+ manual crawl's
+ scheduled crawl's
+ small statistics
+ system overview
+ url uploading
+ small simple search ui
...
you can find the project on github.com. it is a fork of the apache
nutch project from branch "tags/release-1.0"
you can checkout the sources here http://github.com/101tec/nutch/tree/nutch-gui
we plan to push a release at the end of this months.
marko
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec GmbH
Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com